Simply put, what we have done here is taken a rough data set, cleaned it, and run some analysis to communicate some interesting observations and predictions regarding AirBnB’s presence in Brussels. We started by converting variables to formats that would be of use to us. Then, we selected certain variables from the raw data set on the basis of their usefulness in conducting regressions, data visualisations, correlation tests, etc. Using these variables, we plotted bar charts, distribution charts, and more. Lastly, we ran regressions and made some predictions.
We approached the data visualisation with the objective of challenging pre-concieved notions we had regarding relationships between a cusomter and seller in the AirBnB context. Through some qualitative analysis, we answered questions such as “What is the relationship between the quality of a host (seller) and the price they charge” and compared results to our hypotheses.
For the regression part, we started by creating a new variable called Price_4_nights to calculate the cost of staying 4 nights at an Airbnb. Given we were looking specifically at the cost for 2 people, we filtered the data to calculate the cost for only Airbnb’s which could accommodate at least 2 people. However, for the regression model, we have instead decided to create and use log_Price_4_nights as the explained variable since its distribution is close to a normal distribution. Before, starting with the different regression models, we split total dataset into trained and tested the data set.
Model 1
For Model 1, we have tested the significance of property type, the number of reviews and review score rating on the price of an airBnB. At first glance, there is a negative relationship between review score rating and the price for 4 nights at an Airbnb, which seems strange given that normally we would expect properties having higher ratings will have higher prices. However, the negative relationship is very small and is nearly zero and it is not statistically significant. Other variables are significant.Prop_type_simplifiedis a categorical variable, so the first thing we should understand is this regression is choosingentire condoas a base line. The intercept can be interpreted as an entire condominium (condo) will command a log price_4_nights of 5.883. If another property type is chosen such as a private room in rental unit or a private room in residential home, then the log price will be decreased by 0.563 and 0.430 respectively. This make sense as the price of renting a room will be lower than that of an entire condo. In general, property type is a significant predictor of price of an AirBnb. Checking for collinearity, we can see that this is not an issue here in this model due to VIF being lower than 5. Then we run model 1 on our tested dataset, and RMSE = 0.518
Model 2
In model 2, we findreview_score_ratingis insignificant, so we drop it in our following regression. For model 2, we want to determine if room type is a significant predictor of the cost for 4 nights and we find out that every room type, except for a hotel room, is a significant predictor of price. Again, checking for collinearity, we can see that this is not an issue here in this model due to VIF being lower than 5. Checking for overfitting, we find the RMSE = 0.504 on tested data set.
Model 3
For model 3, we want to determine if number of bathrooms, bedrooms, bed and size of the house are significant predictors of the cost for 4 nights. The number of beds is not significant predictors oflog_price_4_nights. However, the number of bedrooms, bathrooms and size of the house are significant predictors. Given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity. Checking for overfitting, we find the RMSE = 0.441 on tested data set.
Model 4
For model 4, we want to understand if superhosts (host_is_superhost) command a pricing premium, after controlling for other variables. At first glance, being a superhost seems command a pricing premium compared to being not. However, it is not statistically significant. So we have 95% confidence to say being a superhost doesn’t command a pricing premium. Given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity. We find the RMSE = 0.418 on tested data set.
Model 5
host_is_superhostis not significant, so we don’t include it in our regression. For model 5, we want to see if the fact that some hosts allows you to immediately book their listing may command a price premium compared to those who don’t. We find out that being able to instantly book an Airbnb is a significant predictor of price. Given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity. Checking RMSE on tested data set, we find RMSE = 0.441.
Model 6
For model 6, we have created a new variable calledneightbourhood_simplified, where we broke down the 19 neighbourhoods in Brussels into 5 neighbourhood based on where they are located in the city of Brussels. We separed the different neighbourhoods into neighbourhoods located in the North West, North East, East/Centre, West/Centre and South/Centre. Location is a good significant predictor oflog_price_4_nightsas seen by t-statistics. Rooms located in the East won’t have a significant effect on price, however, rooms located in North East, North West, South have significant postive effect onprice_4_night. Again, given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity. Checking RMSE on tested data set, we find RMSE = 0.441.
Model 7
For model 7, we try to find the effect of the variableavalability_30orreviews_per_monthonlog_price_4_nights, after we control for other variables. For this model, we findnumber_of_reviewsis not significant, then we try to replace it withreview_scores_rating, then this is significant. This might becausereviews_per_monthcould represent much information ofnumber_of_review, so this variable become insignificant. We also find thatavailability_30andreviews_per_monthhave significant positive effect onprice_4_nights. Again, given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity. Checking RMSE on tested data set, we find RMSE = 0.3703.
Choosing a model
Model 7 has the highest adjusted R^2, and also the lowest RMSE in testing set, which means model7 has the best explaining ability with no overfitting. So we use model7 for prediction.
Prediction
Suppose I want to order a private room in rental unit, located in North West. We want this room to have more than 10 reviews with an average score rating higher than 4.5. Based on the existing dataset, our point estimation for the price I should pay for 4 nights is 123.4 Euros, and 95% upper price is 131.6 Euros, 95% lower price is 115.7 Euros.
glimpse(listings)Rows: 5,442
Columns: 74
$ id <dbl> 2352, 2354, 45145, 48180,~
$ listing_url <chr> "https://www.airbnb.com/r~
$ scrape_id <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped <date> 2021-09-25, 2021-09-25, ~
$ name <chr> "Triplex-2chmbrs,grande s~
$ description <chr> "Cute 2 bedrooms appartme~
$ neighborhood_overview <chr> "Basilique Koekelberg, Ch~
$ picture_url <chr> "https://a0.muscache.com/~
$ host_id <dbl> 2582, 2582, 199370, 21956~
$ host_url <chr> "https://www.airbnb.com/u~
$ host_name <chr> "Oda", "Oda", "Erick", "A~
$ host_since <date> 2008-08-28, 2008-08-28, ~
$ host_location <chr> "Belgium", "Belgium", "Br~
$ host_about <chr> "Hi there! I've been a ho~
$ host_response_time <chr> "within an hour", "within~
$ host_response_rate <chr> "100%", "100%", "N/A", "N~
$ host_acceptance_rate <chr> "100%", "100%", "N/A", "N~
$ host_is_superhost <lgl> FALSE, FALSE, FALSE, FALS~
$ host_thumbnail_url <chr> "https://a0.muscache.com/~
$ host_picture_url <chr> "https://a0.muscache.com/~
$ host_neighbourhood <chr> "Molenbeek-Saint-Jean", "~
$ host_listings_count <dbl> 3, 3, 2, 1, 1, 13, 13, 13~
$ host_total_listings_count <dbl> 3, 3, 2, 1, 1, 13, 13, 13~
$ host_verifications <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified <lgl> FALSE, FALSE, TRUE, FALSE~
$ neighbourhood <chr> "Sint-Jans-Molenbeek, Bru~
$ neighbourhood_cleansed <chr> "Molenbeek-Saint-Jean", "~
$ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, N~
$ latitude <dbl> 50.85702, 50.85709, 50.85~
$ longitude <dbl> 4.30771, 4.30757, 4.36809~
$ property_type <chr> "Entire rental unit", "En~
$ room_type <chr> "Entire home/apt", "Entir~
$ accommodates <dbl> 5, 4, 2, 2, 3, 3, 3, 3, 6~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text <chr> "1 bath", "1 bath", "1 ba~
$ bedrooms <dbl> 2, 1, 1, 2, 1, NA, NA, NA~
$ beds <dbl> 2, 1, 1, 2, 1, 2, 2, 2, 4~
$ amenities <chr> "[\"Baby bath\", \"Luggag~
$ price <chr> "$90.00", "$74.00", "$95.~
$ minimum_nights <dbl> 2, 2, 1, 2, 5, 1, 1, 1, 1~
$ maximum_nights <dbl> 365, 365, 1125, 14, 120, ~
$ minimum_minimum_nights <dbl> 2, 2, 2, 2, 5, 1, 1, 1, 1~
$ maximum_minimum_nights <dbl> 2, 2, 2, 2, 5, 1, 1, 1, 1~
$ minimum_maximum_nights <dbl> 1125, 1125, 1125, 14, 120~
$ maximum_maximum_nights <dbl> 1125, 1125, 1125, 14, 120~
$ minimum_nights_avg_ntm <dbl> 2, 2, 2, 2, 5, 1, 1, 1, 1~
$ maximum_nights_avg_ntm <dbl> 1125, 1125, 1125, 14, 120~
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30 <dbl> 16, 23, 19, 30, 2, 28, 23~
$ availability_60 <dbl> 46, 53, 42, 60, 6, 58, 53~
$ availability_90 <dbl> 76, 83, 67, 90, 36, 88, 8~
$ availability_365 <dbl> 256, 358, 337, 365, 311, ~
$ calendar_last_scraped <date> 2021-09-25, 2021-09-25, ~
$ number_of_reviews <dbl> 17, 2, 3, 0, 105, 5, 62, ~
$ number_of_reviews_ltm <dbl> 1, 0, 0, 0, 0, 1, 2, 0, 0~
$ number_of_reviews_l30d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review <date> 2014-08-26, 2016-04-25, ~
$ last_review <date> 2017-06-30, 2018-10-28, ~
$ review_scores_rating <dbl> 4.44, 4.00, 5.00, NA, 4.8~
$ review_scores_accuracy <dbl> 4.63, 5.00, 5.00, NA, 4.8~
$ review_scores_cleanliness <dbl> 4.69, 5.00, 5.00, NA, 4.8~
$ review_scores_checkin <dbl> 4.56, 5.00, 5.00, NA, 4.9~
$ review_scores_communication <dbl> 4.75, 5.00, 4.00, NA, 4.9~
$ review_scores_location <dbl> 4.00, 5.00, 5.00, NA, 4.8~
$ review_scores_value <dbl> 4.44, 5.00, 4.00, NA, 4.7~
$ license <lgl> NA, NA, NA, NA, NA, NA, N~
$ instant_bookable <lgl> FALSE, FALSE, TRUE, FALSE~
$ calculated_host_listings_count <dbl> 2, 2, 2, 1, 1, 15, 15, 15~
$ calculated_host_listings_count_entire_homes <dbl> 2, 2, 0, 1, 1, 15, 15, 15~
$ calculated_host_listings_count_private_rooms <dbl> 0, 0, 1, 0, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month <dbl> 0.20, 0.03, 0.10, NA, 0.9~
skim(listings, where(is.numeric))| Name | listings |
| Number of rows | 5442 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 3.131539e+07 | 15866077.88 | 2.352000e+03 | 1.824171e+07 | 3.533347e+07 | 4.511619e+07 | 5.242511e+07 | <U+2583><U+2583><U+2583><U+2586><U+2587> |
| scrape_id | 0 | 1.00 | 2.021092e+13 | 0.00 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 1.100261e+08 | 123833524.03 | 2.582000e+03 | 1.733301e+07 | 4.637172e+07 | 1.759884e+08 | 4.236817e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| host_listings_count | 2 | 1.00 | 9.640000e+00 | 39.82 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 2.044000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 2 | 1.00 | 9.640000e+00 | 39.82 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 2.044000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 5.084000e+01 | 0.02 | 5.077000e+01 | 5.083000e+01 | 5.084000e+01 | 5.085000e+01 | 5.090000e+01 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| longitude | 0 | 1.00 | 4.360000e+00 | 0.03 | 4.260000e+00 | 4.340000e+00 | 4.360000e+00 | 4.380000e+00 | 4.480000e+00 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| accommodates | 0 | 1.00 | 3.010000e+00 | 1.77 | 0.000000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | <U+2587><U+2583><U+2581><U+2581><U+2581> |
| bedrooms | 630 | 0.88 | 1.400000e+00 | 1.05 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 83 | 0.98 | 1.710000e+00 | 1.26 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.600000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 1.029000e+01 | 36.19 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 2.339130e+03 | 120486.31 | 1.000000e+00 | 9.000000e+01 | 1.125000e+03 | 1.125000e+03 | 8.888888e+06 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 1 | 1.00 | 9.910000e+00 | 35.85 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 1 | 1.00 | 1.050000e+01 | 36.07 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+00 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 1 | 1.00 | 2.458030e+03 | 120495.62 | 1.000000e+00 | 3.650000e+02 | 1.125000e+03 | 1.125000e+03 | 8.888888e+06 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 1 | 1.00 | 2.476170e+03 | 120495.33 | 1.000000e+00 | 3.650000e+02 | 1.125000e+03 | 1.125000e+03 | 8.888888e+06 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 1 | 1.00 | 1.027000e+01 | 35.98 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.100000e+00 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 1 | 1.00 | 2.472100e+03 | 120495.39 | 1.000000e+00 | 3.650000e+02 | 1.125000e+03 | 1.125000e+03 | 8.888888e+06 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 9.090000e+00 | 10.77 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 1.900000e+01 | 3.000000e+01 | <U+2587><U+2582><U+2581><U+2582><U+2582> |
| availability_60 | 0 | 1.00 | 2.300000e+01 | 22.48 | 0.000000e+00 | 0.000000e+00 | 2.100000e+01 | 4.500000e+01 | 6.000000e+01 | <U+2587><U+2582><U+2582><U+2582><U+2583> |
| availability_90 | 0 | 1.00 | 3.926000e+01 | 34.54 | 0.000000e+00 | 0.000000e+00 | 4.100000e+01 | 7.400000e+01 | 9.000000e+01 | <U+2587><U+2582><U+2582><U+2583><U+2585> |
| availability_365 | 0 | 1.00 | 1.665200e+02 | 134.04 | 0.000000e+00 | 3.500000e+01 | 1.480000e+02 | 3.060000e+02 | 3.650000e+02 | <U+2587><U+2583><U+2583><U+2582><U+2586> |
| number_of_reviews | 0 | 1.00 | 3.537000e+01 | 69.70 | 0.000000e+00 | 2.000000e+00 | 8.000000e+00 | 3.500000e+01 | 7.820000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 5.140000e+00 | 11.57 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 5.000000e+00 | 1.670000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_l30d | 0 | 1.00 | 7.600000e-01 | 1.64 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 2.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 914 | 0.83 | 4.590000e+00 | 0.65 | 0.000000e+00 | 4.500000e+00 | 4.750000e+00 | 4.920000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 960 | 0.82 | 4.720000e+00 | 0.45 | 0.000000e+00 | 4.670000e+00 | 4.850000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 960 | 0.82 | 4.610000e+00 | 0.51 | 0.000000e+00 | 4.500000e+00 | 4.750000e+00 | 4.940000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 960 | 0.82 | 4.790000e+00 | 0.39 | 0.000000e+00 | 4.750000e+00 | 4.900000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 960 | 0.82 | 4.770000e+00 | 0.43 | 0.000000e+00 | 4.740000e+00 | 4.900000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 960 | 0.82 | 4.730000e+00 | 0.38 | 0.000000e+00 | 4.640000e+00 | 4.830000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 960 | 0.82 | 4.600000e+00 | 0.46 | 0.000000e+00 | 4.500000e+00 | 4.700000e+00 | 4.860000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 7.280000e+00 | 15.59 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 9.100000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 5.650000e+00 | 13.72 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 7.800000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 1.560000e+00 | 4.59 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 4.100000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.000000e-02 | 0.16 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 914 | 0.83 | 1.370000e+00 | 1.67 | 1.000000e-02 | 2.700000e-01 | 7.700000e-01 | 1.840000e+00 | 1.234000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
skim(listings, where(is.factor))| Name | listings |
| Number of rows | 5442 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| character | 23 |
| Date | 5 |
| logical | 9 |
| numeric | 37 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| listing_url | 0 | 1.00 | 33 | 37 | 0 | 5442 | 0 |
| name | 0 | 1.00 | 1 | 242 | 0 | 5333 | 0 |
| description | 229 | 0.96 | 1 | 1000 | 0 | 4945 | 0 |
| neighborhood_overview | 2230 | 0.59 | 1 | 1000 | 0 | 2629 | 0 |
| picture_url | 0 | 1.00 | 61 | 126 | 0 | 5311 | 0 |
| host_url | 0 | 1.00 | 38 | 43 | 0 | 3426 | 0 |
| host_name | 2 | 1.00 | 1 | 31 | 0 | 1902 | 0 |
| host_location | 16 | 1.00 | 2 | 70 | 0 | 405 | 0 |
| host_about | 2656 | 0.51 | 1 | 3655 | 0 | 1535 | 5 |
| host_response_time | 2 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 2 | 1.00 | 2 | 4 | 0 | 55 | 0 |
| host_acceptance_rate | 2 | 1.00 | 2 | 4 | 0 | 83 | 0 |
| host_thumbnail_url | 2 | 1.00 | 55 | 106 | 0 | 3390 | 0 |
| host_picture_url | 2 | 1.00 | 57 | 109 | 0 | 3390 | 0 |
| host_neighbourhood | 2028 | 0.63 | 5 | 29 | 0 | 88 | 0 |
| host_verifications | 0 | 1.00 | 4 | 141 | 0 | 188 | 0 |
| neighbourhood | 2230 | 0.59 | 7 | 63 | 0 | 126 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 21 | 0 | 19 | 0 |
| property_type | 0 | 1.00 | 4 | 35 | 0 | 45 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
| bathrooms_text | 12 | 1.00 | 6 | 17 | 0 | 27 | 0 |
| amenities | 0 | 1.00 | 2 | 1666 | 0 | 5025 | 0 |
| price | 0 | 1.00 | 5 | 9 | 0 | 290 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-24 | 2021-09-25 | 2021-09-25 | 2 |
| host_since | 2 | 1.00 | 2008-08-28 | 2021-09-19 | 2015-10-19 | 2100 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-24 | 2021-09-25 | 2021-09-25 | 2 |
| first_review | 914 | 0.83 | 2011-06-06 | 2021-09-23 | 2019-05-27 | 1778 |
| last_review | 914 | 0.83 | 2010-11-06 | 2021-09-24 | 2020-03-11 | 1219 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 2 | 1 | 0.20 | FAL: 4366, TRU: 1074 |
| host_has_profile_pic | 2 | 1 | 0.99 | TRU: 5394, FAL: 46 |
| host_identity_verified | 2 | 1 | 0.87 | TRU: 4741, FAL: 699 |
| neighbourhood_group_cleansed | 5442 | 0 | NaN | : |
| bathrooms | 5442 | 0 | NaN | : |
| calendar_updated | 5442 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.99 | TRU: 5382, FAL: 60 |
| license | 5442 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.37 | FAL: 3446, TRU: 1996 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| id | 0 | 1.00 | 3.131539e+07 | 15866077.88 | 2.352000e+03 | 1.824171e+07 | 3.533347e+07 | 4.511619e+07 | 5.242511e+07 | <U+2583><U+2583><U+2583><U+2586><U+2587> |
| scrape_id | 0 | 1.00 | 2.021092e+13 | 0.00 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | 2.021092e+13 | <U+2581><U+2581><U+2587><U+2581><U+2581> |
| host_id | 0 | 1.00 | 1.100261e+08 | 123833524.03 | 2.582000e+03 | 1.733301e+07 | 4.637172e+07 | 1.759884e+08 | 4.236817e+08 | <U+2587><U+2582><U+2581><U+2581><U+2581> |
| host_listings_count | 2 | 1.00 | 9.640000e+00 | 39.82 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 2.044000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 2 | 1.00 | 9.640000e+00 | 39.82 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 2.044000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 5.084000e+01 | 0.02 | 5.077000e+01 | 5.083000e+01 | 5.084000e+01 | 5.085000e+01 | 5.090000e+01 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| longitude | 0 | 1.00 | 4.360000e+00 | 0.03 | 4.260000e+00 | 4.340000e+00 | 4.360000e+00 | 4.380000e+00 | 4.480000e+00 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| accommodates | 0 | 1.00 | 3.010000e+00 | 1.77 | 0.000000e+00 | 2.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.600000e+01 | <U+2587><U+2583><U+2581><U+2581><U+2581> |
| bedrooms | 630 | 0.88 | 1.400000e+00 | 1.05 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 83 | 0.98 | 1.710000e+00 | 1.26 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 1.600000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 1.029000e+01 | 36.19 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 2.339130e+03 | 120486.31 | 1.000000e+00 | 9.000000e+01 | 1.125000e+03 | 1.125000e+03 | 8.888888e+06 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_minimum_nights | 1 | 1.00 | 9.910000e+00 | 35.85 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.000000e+00 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_minimum_nights | 1 | 1.00 | 1.050000e+01 | 36.07 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 5.000000e+00 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_maximum_nights | 1 | 1.00 | 2.458030e+03 | 120495.62 | 1.000000e+00 | 3.650000e+02 | 1.125000e+03 | 1.125000e+03 | 8.888888e+06 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_maximum_nights | 1 | 1.00 | 2.476170e+03 | 120495.33 | 1.000000e+00 | 3.650000e+02 | 1.125000e+03 | 1.125000e+03 | 8.888888e+06 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 1 | 1.00 | 1.027000e+01 | 35.98 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 4.100000e+00 | 1.125000e+03 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 1 | 1.00 | 2.472100e+03 | 120495.39 | 1.000000e+00 | 3.650000e+02 | 1.125000e+03 | 1.125000e+03 | 8.888888e+06 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| availability_30 | 0 | 1.00 | 9.090000e+00 | 10.77 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | 1.900000e+01 | 3.000000e+01 | <U+2587><U+2582><U+2581><U+2582><U+2582> |
| availability_60 | 0 | 1.00 | 2.300000e+01 | 22.48 | 0.000000e+00 | 0.000000e+00 | 2.100000e+01 | 4.500000e+01 | 6.000000e+01 | <U+2587><U+2582><U+2582><U+2582><U+2583> |
| availability_90 | 0 | 1.00 | 3.926000e+01 | 34.54 | 0.000000e+00 | 0.000000e+00 | 4.100000e+01 | 7.400000e+01 | 9.000000e+01 | <U+2587><U+2582><U+2582><U+2583><U+2585> |
| availability_365 | 0 | 1.00 | 1.665200e+02 | 134.04 | 0.000000e+00 | 3.500000e+01 | 1.480000e+02 | 3.060000e+02 | 3.650000e+02 | <U+2587><U+2583><U+2583><U+2582><U+2586> |
| number_of_reviews | 0 | 1.00 | 3.537000e+01 | 69.70 | 0.000000e+00 | 2.000000e+00 | 8.000000e+00 | 3.500000e+01 | 7.820000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_ltm | 0 | 1.00 | 5.140000e+00 | 11.57 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 5.000000e+00 | 1.670000e+02 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews_l30d | 0 | 1.00 | 7.600000e-01 | 1.64 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 2.000000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 914 | 0.83 | 4.590000e+00 | 0.65 | 0.000000e+00 | 4.500000e+00 | 4.750000e+00 | 4.920000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_accuracy | 960 | 0.82 | 4.720000e+00 | 0.45 | 0.000000e+00 | 4.670000e+00 | 4.850000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_cleanliness | 960 | 0.82 | 4.610000e+00 | 0.51 | 0.000000e+00 | 4.500000e+00 | 4.750000e+00 | 4.940000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_checkin | 960 | 0.82 | 4.790000e+00 | 0.39 | 0.000000e+00 | 4.750000e+00 | 4.900000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_communication | 960 | 0.82 | 4.770000e+00 | 0.43 | 0.000000e+00 | 4.740000e+00 | 4.900000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_location | 960 | 0.82 | 4.730000e+00 | 0.38 | 0.000000e+00 | 4.640000e+00 | 4.830000e+00 | 5.000000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| review_scores_value | 960 | 0.82 | 4.600000e+00 | 0.46 | 0.000000e+00 | 4.500000e+00 | 4.700000e+00 | 4.860000e+00 | 5.000000e+00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| calculated_host_listings_count | 0 | 1.00 | 7.280000e+00 | 15.59 | 1.000000e+00 | 1.000000e+00 | 1.000000e+00 | 4.000000e+00 | 9.100000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_entire_homes | 0 | 1.00 | 5.650000e+00 | 13.72 | 0.000000e+00 | 1.000000e+00 | 1.000000e+00 | 2.000000e+00 | 7.800000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_private_rooms | 0 | 1.00 | 1.560000e+00 | 4.59 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 1.000000e+00 | 4.100000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| calculated_host_listings_count_shared_rooms | 0 | 1.00 | 1.000000e-02 | 0.16 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 0.000000e+00 | 3.000000e+00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| reviews_per_month | 914 | 0.83 | 1.370000e+00 | 1.67 | 1.000000e-02 | 2.700000e-01 | 7.700000e-01 | 1.840000e+00 | 1.234000e+01 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
skim(listings, where(is.Date))| Name | listings |
| Number of rows | 5442 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| Date | 5 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| last_scraped | 0 | 1.00 | 2021-09-24 | 2021-09-25 | 2021-09-25 | 2 |
| host_since | 2 | 1.00 | 2008-08-28 | 2021-09-19 | 2015-10-19 | 2100 |
| calendar_last_scraped | 0 | 1.00 | 2021-09-24 | 2021-09-25 | 2021-09-25 | 2 |
| first_review | 914 | 0.83 | 2011-06-06 | 2021-09-23 | 2019-05-27 | 1778 |
| last_review | 914 | 0.83 | 2010-11-06 | 2021-09-24 | 2020-03-11 | 1219 |
skim(listings, where(is.logical))| Name | listings |
| Number of rows | 5442 |
| Number of columns | 74 |
| _______________________ | |
| Column type frequency: | |
| logical | 9 |
| ________________________ | |
| Group variables | None |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 2 | 1 | 0.20 | FAL: 4366, TRU: 1074 |
| host_has_profile_pic | 2 | 1 | 0.99 | TRU: 5394, FAL: 46 |
| host_identity_verified | 2 | 1 | 0.87 | TRU: 4741, FAL: 699 |
| neighbourhood_group_cleansed | 5442 | 0 | NaN | : |
| bathrooms | 5442 | 0 | NaN | : |
| calendar_updated | 5442 | 0 | NaN | : |
| has_availability | 0 | 1 | 0.99 | TRU: 5382, FAL: 60 |
| license | 5442 | 0 | NaN | : |
| instant_bookable | 0 | 1 | 0.37 | FAL: 3446, TRU: 1996 |
There are 74 variables and 5442 rows
There are 37 numeric variables. They are:
- id
- scrape_id
- host_id
- host_listings_count
- host_total_listings_count
- latitude
- longitude
- accommodates
- bedrooms
- beds
- minimum_nights
- maximum_nights
- minimum_minimum_nights
- maximum_minimum_nights
- minimum_maximum_nights
- maximum_maximum_nights
- minimum_nights_avg_ntm
- maximum_nights_avg_ntm
- availability_30
- availability_60
- availability_90
- availability_365
- number_of_reviews
- number_of_reviews_ltm
- number_of_reviews_130d
- review_scores_rating
- review_scores_accuracy
- review_scores_cleanliness
- review_scores_checkin
- review_scores_communication
- review_scores_location
- review_scores_value
- calculated_host_listings_count
- calculated_host_listings_count_entire_homes
- calculated_host_listings_count_private_rooms
- calculated_host_listings_count_shared_rooms
- reviews_per_month
Categorical vriables:
- host_verifications
- host_has_profile_pic
- host_identity_verified
- neighbourhood
- neighbourhood_cleansed
- property_type
- room_type
- has_availability
- instant_bookable
listings <- listings %>%
mutate(price = parse_number(price),
bathrooms = parse_number(bathrooms_text))
typeof(listings$price)[1] "double"
typeof(listings$bathrooms)[1] "double"
# select some important varialbes to calculate summary statistics
important_var <- c('host_listings_count', 'host_total_listings_count', 'accommodates', 'bathrooms','bedrooms', 'beds', 'price', 'minimum_nights', 'maximum_nights', 'minimum_minimum_nights', 'maximum_minimum_nights', 'minimum_maximum_nights', 'minimum_nights_avg_ntm', 'maximum_nights_avg_ntm', 'availability_30', 'availability_60', 'availability_90', 'availability_365', 'number_of_reviews', 'number_of_reviews_ltm', 'number_of_reviews_l30d', 'review_scores_rating', 'review_scores_accuracy')
listings[, c(important_var)] %>%
pivot_longer(cols = c(1:23), names_to = 'variable', values_to = 'value' ) %>%
group_by(variable) %>%
summarise(favstats(value))| variable | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| accommodates | 0 | 2 | 2 | 4 | 16 | 3.01 | 1.77 | 5442 | 0 |
| availability_30 | 0 | 0 | 3 | 19 | 30 | 9.09 | 10.8 | 5442 | 0 |
| availability_365 | 0 | 35 | 148 | 306 | 365 | 167 | 134 | 5442 | 0 |
| availability_60 | 0 | 0 | 21 | 45 | 60 | 23 | 22.5 | 5442 | 0 |
| availability_90 | 0 | 0 | 41 | 74 | 90 | 39.3 | 34.5 | 5442 | 0 |
| bathrooms | 0 | 1 | 1 | 1 | 19.5 | 1.19 | 0.564 | 5411 | 31 |
| bedrooms | 1 | 1 | 1 | 2 | 40 | 1.4 | 1.05 | 4812 | 630 |
| beds | 0 | 1 | 1 | 2 | 16 | 1.71 | 1.26 | 5359 | 83 |
| host_listings_count | 0 | 1 | 1 | 4 | 2.04e+03 | 9.64 | 39.8 | 5440 | 2 |
| host_total_listings_count | 0 | 1 | 1 | 4 | 2.04e+03 | 9.64 | 39.8 | 5440 | 2 |
| maximum_minimum_nights | 1 | 1 | 2 | 5 | 1.12e+03 | 10.5 | 36.1 | 5441 | 1 |
| maximum_nights | 1 | 90 | 1.12e+03 | 1.12e+03 | 8.89e+06 | 2.34e+03 | 1.2e+05 | 5442 | 0 |
| maximum_nights_avg_ntm | 1 | 365 | 1.12e+03 | 1.12e+03 | 8.89e+06 | 2.47e+03 | 1.2e+05 | 5441 | 1 |
| minimum_maximum_nights | 1 | 365 | 1.12e+03 | 1.12e+03 | 8.89e+06 | 2.46e+03 | 1.2e+05 | 5441 | 1 |
| minimum_minimum_nights | 1 | 1 | 2 | 4 | 1.12e+03 | 9.91 | 35.9 | 5441 | 1 |
| minimum_nights | 1 | 1 | 2 | 4 | 1.12e+03 | 10.3 | 36.2 | 5442 | 0 |
| minimum_nights_avg_ntm | 1 | 1 | 2 | 4.1 | 1.12e+03 | 10.3 | 36 | 5441 | 1 |
| number_of_reviews | 0 | 2 | 8 | 35 | 782 | 35.4 | 69.7 | 5442 | 0 |
| number_of_reviews_l30d | 0 | 0 | 0 | 1 | 20 | 0.761 | 1.64 | 5442 | 0 |
| number_of_reviews_ltm | 0 | 0 | 1 | 5 | 167 | 5.14 | 11.6 | 5442 | 0 |
| price | 0 | 46 | 65 | 92 | 5e+03 | 87.1 | 132 | 5442 | 0 |
| review_scores_accuracy | 0 | 4.67 | 4.85 | 5 | 5 | 4.72 | 0.452 | 4482 | 960 |
| review_scores_rating | 0 | 4.5 | 4.75 | 4.92 | 5 | 4.59 | 0.65 | 4528 | 914 |
#Variables of interests
new_listings <- listings %>%
select(host_since, host_location, host_response_time, host_response_rate, host_is_superhost, host_neighbourhood, host_listings_count, host_total_listings_count, host_has_profile_pic, host_identity_verified, neighbourhood_cleansed, latitude, longitude, property_type, room_type, accommodates, bathrooms, bedrooms, beds, price, minimum_nights, maximum_nights, minimum_nights_avg_ntm, maximum_nights_avg_ntm, has_availability, number_of_reviews, review_scores_rating, instant_bookable, availability_30, reviews_per_month)
skim(new_listings, where(is.numeric))| Name | new_listings |
| Number of rows | 5442 |
| Number of columns | 30 |
| _______________________ | |
| Column type frequency: | |
| numeric | 17 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| host_listings_count | 2 | 1.00 | 9.64 | 39.82 | 0.00 | 1.00 | 1.00 | 4.00 | 2044.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 2 | 1.00 | 9.64 | 39.82 | 0.00 | 1.00 | 1.00 | 4.00 | 2044.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 50.84 | 0.02 | 50.77 | 50.83 | 50.84 | 50.85 | 50.90 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| longitude | 0 | 1.00 | 4.36 | 0.03 | 4.26 | 4.34 | 4.36 | 4.38 | 4.48 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| accommodates | 0 | 1.00 | 3.01 | 1.77 | 0.00 | 2.00 | 2.00 | 4.00 | 16.00 | <U+2587><U+2583><U+2581><U+2581><U+2581> |
| bathrooms | 31 | 0.99 | 1.19 | 0.56 | 0.00 | 1.00 | 1.00 | 1.00 | 19.50 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 630 | 0.88 | 1.40 | 1.05 | 1.00 | 1.00 | 1.00 | 2.00 | 40.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 83 | 0.98 | 1.71 | 1.26 | 0.00 | 1.00 | 1.00 | 2.00 | 16.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| price | 0 | 1.00 | 87.13 | 132.37 | 0.00 | 46.00 | 65.00 | 92.00 | 5000.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 10.29 | 36.19 | 1.00 | 1.00 | 2.00 | 4.00 | 1125.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 2339.13 | 120486.31 | 1.00 | 90.00 | 1125.00 | 1125.00 | 8888888.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 1 | 1.00 | 10.27 | 35.98 | 1.00 | 1.00 | 2.00 | 4.10 | 1125.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 1 | 1.00 | 2472.10 | 120495.39 | 1.00 | 365.00 | 1125.00 | 1125.00 | 8888888.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1.00 | 35.37 | 69.70 | 0.00 | 2.00 | 8.00 | 35.00 | 782.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 914 | 0.83 | 4.59 | 0.65 | 0.00 | 4.50 | 4.75 | 4.92 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| availability_30 | 0 | 1.00 | 9.09 | 10.77 | 0.00 | 0.00 | 3.00 | 19.00 | 30.00 | <U+2587><U+2582><U+2581><U+2582><U+2582> |
| reviews_per_month | 914 | 0.83 | 1.37 | 1.67 | 0.01 | 0.27 | 0.77 | 1.84 | 12.34 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
skim(new_listings, where(is.factor))| Name | new_listings |
| Number of rows | 5442 |
| Number of columns | 30 |
| _______________________ | |
| Column type frequency: | |
| character | 7 |
| Date | 1 |
| logical | 5 |
| numeric | 17 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| host_location | 16 | 1.00 | 2 | 70 | 0 | 405 | 0 |
| host_response_time | 2 | 1.00 | 3 | 18 | 0 | 5 | 0 |
| host_response_rate | 2 | 1.00 | 2 | 4 | 0 | 55 | 0 |
| host_neighbourhood | 2028 | 0.63 | 5 | 29 | 0 | 88 | 0 |
| neighbourhood_cleansed | 0 | 1.00 | 5 | 21 | 0 | 19 | 0 |
| property_type | 0 | 1.00 | 4 | 35 | 0 | 45 | 0 |
| room_type | 0 | 1.00 | 10 | 15 | 0 | 4 | 0 |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 2 | 1 | 2008-08-28 | 2021-09-19 | 2015-10-19 | 2100 |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 2 | 1 | 0.20 | FAL: 4366, TRU: 1074 |
| host_has_profile_pic | 2 | 1 | 0.99 | TRU: 5394, FAL: 46 |
| host_identity_verified | 2 | 1 | 0.87 | TRU: 4741, FAL: 699 |
| has_availability | 0 | 1 | 0.99 | TRU: 5382, FAL: 60 |
| instant_bookable | 0 | 1 | 0.37 | FAL: 3446, TRU: 1996 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| host_listings_count | 2 | 1.00 | 9.64 | 39.82 | 0.00 | 1.00 | 1.00 | 4.00 | 2044.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| host_total_listings_count | 2 | 1.00 | 9.64 | 39.82 | 0.00 | 1.00 | 1.00 | 4.00 | 2044.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| latitude | 0 | 1.00 | 50.84 | 0.02 | 50.77 | 50.83 | 50.84 | 50.85 | 50.90 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| longitude | 0 | 1.00 | 4.36 | 0.03 | 4.26 | 4.34 | 4.36 | 4.38 | 4.48 | <U+2581><U+2583><U+2587><U+2582><U+2581> |
| accommodates | 0 | 1.00 | 3.01 | 1.77 | 0.00 | 2.00 | 2.00 | 4.00 | 16.00 | <U+2587><U+2583><U+2581><U+2581><U+2581> |
| bathrooms | 31 | 0.99 | 1.19 | 0.56 | 0.00 | 1.00 | 1.00 | 1.00 | 19.50 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| bedrooms | 630 | 0.88 | 1.40 | 1.05 | 1.00 | 1.00 | 1.00 | 2.00 | 40.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| beds | 83 | 0.98 | 1.71 | 1.26 | 0.00 | 1.00 | 1.00 | 2.00 | 16.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| price | 0 | 1.00 | 87.13 | 132.37 | 0.00 | 46.00 | 65.00 | 92.00 | 5000.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights | 0 | 1.00 | 10.29 | 36.19 | 1.00 | 1.00 | 2.00 | 4.00 | 1125.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights | 0 | 1.00 | 2339.13 | 120486.31 | 1.00 | 90.00 | 1125.00 | 1125.00 | 8888888.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| minimum_nights_avg_ntm | 1 | 1.00 | 10.27 | 35.98 | 1.00 | 1.00 | 2.00 | 4.10 | 1125.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| maximum_nights_avg_ntm | 1 | 1.00 | 2472.10 | 120495.39 | 1.00 | 365.00 | 1125.00 | 1125.00 | 8888888.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| number_of_reviews | 0 | 1.00 | 35.37 | 69.70 | 0.00 | 2.00 | 8.00 | 35.00 | 782.00 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
| review_scores_rating | 914 | 0.83 | 4.59 | 0.65 | 0.00 | 4.50 | 4.75 | 4.92 | 5.00 | <U+2581><U+2581><U+2581><U+2581><U+2587> |
| availability_30 | 0 | 1.00 | 9.09 | 10.77 | 0.00 | 0.00 | 3.00 | 19.00 | 30.00 | <U+2587><U+2582><U+2581><U+2582><U+2582> |
| reviews_per_month | 914 | 0.83 | 1.37 | 1.67 | 0.01 | 0.27 | 0.77 | 1.84 | 12.34 | <U+2587><U+2581><U+2581><U+2581><U+2581> |
skim(new_listings, where(is.Date))| Name | new_listings |
| Number of rows | 5442 |
| Number of columns | 30 |
| _______________________ | |
| Column type frequency: | |
| Date | 1 |
| ________________________ | |
| Group variables | None |
Variable type: Date
| skim_variable | n_missing | complete_rate | min | max | median | n_unique |
|---|---|---|---|---|---|---|
| host_since | 2 | 1 | 2008-08-28 | 2021-09-19 | 2015-10-19 | 2100 |
skim(new_listings, where(is.logical))| Name | new_listings |
| Number of rows | 5442 |
| Number of columns | 30 |
| _______________________ | |
| Column type frequency: | |
| logical | 5 |
| ________________________ | |
| Group variables | None |
Variable type: logical
| skim_variable | n_missing | complete_rate | mean | count |
|---|---|---|---|---|
| host_is_superhost | 2 | 1 | 0.20 | FAL: 4366, TRU: 1074 |
| host_has_profile_pic | 2 | 1 | 0.99 | TRU: 5394, FAL: 46 |
| host_identity_verified | 2 | 1 | 0.87 | TRU: 4741, FAL: 699 |
| has_availability | 0 | 1 | 0.99 | TRU: 5382, FAL: 60 |
| instant_bookable | 0 | 1 | 0.37 | FAL: 3446, TRU: 1996 |
Based on the results of skim, there are 229 missing values in description, 2230 missing in neighborhood_overview, 2 missing in host_name, 16 missing in host_location, 2656 missing in host_about, 2 missing in host_response time, host_response_rate, host_acceptance_rate, host_thumbnail_rul, host_picture_url, 2028 missing in host_neighbourhood, 2230 missing in neighbourhood, 12 missing in bathrooms_text, 2 missing in host_since, 914 missing in first_review and last_review, 2 missing in host_is_superhost, host_has_profile_pic, host_identity_verified, 5442 missing in neighbourhood_group_cleansed, bathrooms, calendar_updated and license. Furthermore, there are 16 missing in host_location, 2 missing in host_response time, host_response_rate, host_acceptance_rate, 2028 missing in host_neighbourhood,2 missing in host_since,2 missing in host_is_superhost, host_has_profile_pic, host_identity_verified.
new_listings %>%
count(property_type) %>%
arrange(desc(`n`)) %>%
pivot_wider(names_from = property_type, values_from = n) %>%
mutate(total = rowSums(.)) %>%
pivot_longer(col = 1:45, names_to = 'property_type', values_to = 'count' ) %>%
mutate(proportion = count / total)| total | property_type | count | proportion |
|---|---|---|---|
| 5.44e+03 | Entire rental unit | 2866 | 0.527 |
| 5.44e+03 | Private room in rental unit | 716 | 0.132 |
| 5.44e+03 | Entire condominium (condo) | 298 | 0.0548 |
| 5.44e+03 | Private room in residential home | 283 | 0.052 |
| 5.44e+03 | Entire serviced apartment | 219 | 0.0402 |
| 5.44e+03 | Private room in townhouse | 165 | 0.0303 |
| 5.44e+03 | Entire residential home | 159 | 0.0292 |
| 5.44e+03 | Entire loft | 144 | 0.0265 |
| 5.44e+03 | Private room in bed and breakfast | 106 | 0.0195 |
| 5.44e+03 | Private room in condominium (condo) | 81 | 0.0149 |
| 5.44e+03 | Entire townhouse | 65 | 0.0119 |
| 5.44e+03 | Room in hotel | 62 | 0.0114 |
| 5.44e+03 | Private room in loft | 39 | 0.00717 |
| 5.44e+03 | Room in boutique hotel | 30 | 0.00551 |
| 5.44e+03 | Private room in guesthouse | 27 | 0.00496 |
| 5.44e+03 | Shared room in rental unit | 26 | 0.00478 |
| 5.44e+03 | Room in bed and breakfast | 23 | 0.00423 |
| 5.44e+03 | Entire guesthouse | 18 | 0.00331 |
| 5.44e+03 | Entire guest suite | 17 | 0.00312 |
| 5.44e+03 | Private room in guest suite | 15 | 0.00276 |
| 5.44e+03 | Room in aparthotel | 15 | 0.00276 |
| 5.44e+03 | Private room in villa | 11 | 0.00202 |
| 5.44e+03 | Private room in casa particular | 7 | 0.00129 |
| 5.44e+03 | Entire villa | 6 | 0.0011 |
| 5.44e+03 | Private room | 6 | 0.0011 |
| 5.44e+03 | Private room in nature lodge | 5 | 0.000919 |
| 5.44e+03 | Tiny house | 5 | 0.000919 |
| 5.44e+03 | Private room in serviced apartment | 4 | 0.000735 |
| 5.44e+03 | Room in serviced apartment | 4 | 0.000735 |
| 5.44e+03 | Shared room in condominium (condo) | 3 | 0.000551 |
| 5.44e+03 | Entire bed and breakfast | 2 | 0.000368 |
| 5.44e+03 | Private room in tiny house | 2 | 0.000368 |
| 5.44e+03 | Barn | 1 | 0.000184 |
| 5.44e+03 | Entire cottage | 1 | 0.000184 |
| 5.44e+03 | Entire place | 1 | 0.000184 |
| 5.44e+03 | Floor | 1 | 0.000184 |
| 5.44e+03 | Private room in barn | 1 | 0.000184 |
| 5.44e+03 | Private room in castle | 1 | 0.000184 |
| 5.44e+03 | Private room in dome house | 1 | 0.000184 |
| 5.44e+03 | Private room in farm stay | 1 | 0.000184 |
| 5.44e+03 | Private room in floor | 1 | 0.000184 |
| 5.44e+03 | Private room in hostel | 1 | 0.000184 |
| 5.44e+03 | Shared room | 1 | 0.000184 |
| 5.44e+03 | Shared room in residential home | 1 | 0.000184 |
| 5.44e+03 | Shared room in serviced apartment | 1 | 0.000184 |
The top 4 property type are ‘Entire rental unit’, ‘Private room in rental unit’, ‘Entire condominium (condo)’, and ‘Private room in residential home’, their proportions are 52.7%, 13.2%, 5.48%, 5.20%.
new_listings <- new_listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit","Private room in rental unit", "Entire condominium (condo)","Private room in residential home") ~ property_type,
TRUE ~ "Other"
))new_listings %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n)) | property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 2866 |
| Private room in rental unit | Private room in rental unit | 716 |
| Entire condominium (condo) | Entire condominium (condo) | 298 |
| Private room in residential home | Private room in residential home | 283 |
| Entire serviced apartment | Other | 219 |
| Private room in townhouse | Other | 165 |
| Entire residential home | Other | 159 |
| Entire loft | Other | 144 |
| Private room in bed and breakfast | Other | 106 |
| Private room in condominium (condo) | Other | 81 |
| Entire townhouse | Other | 65 |
| Room in hotel | Other | 62 |
| Private room in loft | Other | 39 |
| Room in boutique hotel | Other | 30 |
| Private room in guesthouse | Other | 27 |
| Shared room in rental unit | Other | 26 |
| Room in bed and breakfast | Other | 23 |
| Entire guesthouse | Other | 18 |
| Entire guest suite | Other | 17 |
| Private room in guest suite | Other | 15 |
| Room in aparthotel | Other | 15 |
| Private room in villa | Other | 11 |
| Private room in casa particular | Other | 7 |
| Entire villa | Other | 6 |
| Private room | Other | 6 |
| Private room in nature lodge | Other | 5 |
| Tiny house | Other | 5 |
| Private room in serviced apartment | Other | 4 |
| Room in serviced apartment | Other | 4 |
| Shared room in condominium (condo) | Other | 3 |
| Entire bed and breakfast | Other | 2 |
| Private room in tiny house | Other | 2 |
| Barn | Other | 1 |
| Entire cottage | Other | 1 |
| Entire place | Other | 1 |
| Floor | Other | 1 |
| Private room in barn | Other | 1 |
| Private room in castle | Other | 1 |
| Private room in dome house | Other | 1 |
| Private room in farm stay | Other | 1 |
| Private room in floor | Other | 1 |
| Private room in hostel | Other | 1 |
| Shared room | Other | 1 |
| Shared room in residential home | Other | 1 |
| Shared room in serviced apartment | Other | 1 |
new_listings %>%
mutate(minimum_nights = as.factor(minimum_nights)) %>%
group_by(minimum_nights) %>%
count() %>%
arrange(desc(n))# A tibble: 61 x 2
# Groups: minimum_nights [61]
minimum_nights n
<fct> <int>
1 1 1656
2 2 1528
3 3 661
4 5 275
5 4 257
6 7 233
7 90 131
8 30 105
9 6 76
10 14 72
# ... with 51 more rows
minimum_nights?The most common value is 1 day.
90, 30 days stand out among those common values
minimum_nights?The unusual values are either 1 month or 1 quater, which indicates that house hosts have high intention to let their house for long-term purpose (1 month or 1 quater)
new_listings <- new_listings %>%
filter(minimum_nights <= 4) #filtering data to only allow have a min of 4 nightsleaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)What we intend to do in the next section is use the visualisation tools we have learnt so far to answer some interesting questions we have come up with it.
# Creating the data table that we will use to plot the top room frequency graph
top_room_type <- new_listings %>%
group_by(room_type) %>%
summarise(room_type_count = count(room_type))
# Creating the data table that we will use to plot the top average accommodating room type graph
average_number_accomodated_by_room_type <- new_listings %>%
group_by(room_type) %>%
summarise(average_accomodated = mean(accommodates))
room_type_bar_graph <- top_room_type %>%
ggplot(aes(x = room_type_count, y = fct_reorder(room_type, room_type_count))) +
geom_col(fill='yellow') +
theme_bw()+
labs(
title = "What type of listings (by room type) are most common in Brussels?",
subtitle = NULL,
x = "Count",
y = NULL )
room_type_bar_graphaverage_number_accomodated_bar_graph <- average_number_accomodated_by_room_type %>%
ggplot(aes(x = average_accomodated, y = fct_reorder(room_type, average_accomodated))) +
geom_col(fill='blue') +
theme_bw()+
labs(
title = "On average, which type of room accomodates the most people?",
subtitle = NULL,
x = "Average Number Accomodated",
y = NULL )
average_number_accomodated_bar_graphprice_by_prop_type_histo <- ggplot(new_listings, aes(x = price))+
geom_boxplot(outlier.colour = "red", outlier.shape =8, outlier.size =4) +
facet_wrap(~room_type)+
theme_bw()+
xlim(0,300)+
labs(title = "AirBnB listings in Brussels' price distribution",
x = "Price",
y = "")
price_by_prop_type_histo price_by_prop_type_density <- ggplot(new_listings, aes(x = price))+
geom_density(fill = 'orange')+
theme_bw()+
xlim(0,300)+
facet_wrap(~ room_type)+
labs(title = "AirBnB listings in Brussels' price distribution",
x = "Price",
y = "", )
price_by_prop_type_densityThe objective here was to identify how the price of listings are distributed with the listings grouped by room-type. What we note here is that all the distributions are right skewed and (pretty much) multi-modal. The implicaiton here of this result is that the mean price is greater than the median. Speciically, there are certain listings that are priced significantly above the typical price, resultantly skewing the distribution. This is a result we can expect, particullarly in the context of luxury listings.
In the following analysis, we use the GGpairs plot to qualitatively answer a set of questions based on relationships between two variables.
# GGpairs plot to answer it all
new_listings %>%
select(price, minimum_nights, maximum_nights, beds, host_identity_verified, review_scores_rating) %>%
ggpairs(aes(alpha=0.1))+
theme_bw()We would conjecture that as the number of minimum nights increases, the listing price would decrease and this is in fact the case (correlations, not causation). The negative correlation of -0.043, albeit weak, makes sense as setting a higher minimum number nights is a restriction for customers that has to compensated for with lower prices by the host.
We would expect that the correlation between number of beds and price to be positive. The results describe a weak positive correlation of 0.252. The logic behind this is relatively obvious. Bigger beds would imply a bigger house/unit. The customer would be expected to pay for a bigger house. The reason why the correlation isn’t strong, howerver, could be because of the confounding factors we have not factored in such as location, for example.
We could expect that if a host is verified, they have more credibility and thus can charge a higher price. Oddly, there is a very weak negative correlation of -0.009. Again, there are numerous confounders that would stop us from making conclusive statements. Perhaps adding other regressors would return a positive correlation.
#Creation of variable price_4_nights
new_listings <- new_listings %>%
filter(accommodates > 1) %>%
mutate(price_4_nights = price * 4)#creating new variables called `neighbourhood_simplified` for later regression
new_listings <- new_listings %>%
mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% c("Jette","Berchem-Sainte-Agathe","Koekelberg", "Molenbeek-Saint-Jean",
"Ganshoren") ~ "North West",
neighbourhood_cleansed %in% c( "Saint-Josse-ten-Noode", "Schaerbeek", "Bruxelles", "Evere") ~ "North East",
neighbourhood_cleansed %in% c("Woluwe-Saint-Lambert", "Woluwe-Saint-Pierre","Auderghem", "Etterbeek") ~ "East/Centre",
neighbourhood_cleansed %in% c("Saint-Gilles", "Anderlecht", "Forest") ~ " West/Centre",
neighbourhood_cleansed %in% c("Ixelles", "Uccle", "Watermael-Boitsfort") ~ "South/Centre"))#Creation of new variable log-Price_4_nights
new_listings <- new_listings %>%
mutate(log_price_4_nights = log(price_4_nights))
#Creating a histogram to examine distribution of price_4_nights
ggplot(data = new_listings, aes(x = price_4_nights)) +
geom_histogram(color = "white", fill = "steelblue") +
theme_bw() +
labs(title = "Distribution of price_4_nights in histogram graph",
x = "price_4_nights",
y = "")#Creating a histogram to examine distribution of log(price_4_nights)
ggplot(data = new_listings, aes(x = log_price_4_nights)) +
geom_histogram(color = "white", fill = "steelblue") +
theme_bw() +
labs(title = "Distribution of log(price_4_nights) in histogram graph",
x = "log(price_4_nights)",
y = "")We should use log(price_4_nights) for the model as its distribution is a normal distribution.
library(rsample)
set.seed(1234)
#new_listings <- new_listings %>% na.omit() #drop na
train_test_split <- initial_split(new_listings, prop = 0.7)
train_data <- training(train_test_split)
test_data <- testing(train_test_split)#checking the types of prop_type-simplified
new_listings %>%
group_by(prop_type_simplified) %>%
summarise(n = n()) %>%
arrange(desc(n))| prop_type_simplified | n |
|---|---|
| Entire rental unit | 2065 |
| Other | 906 |
| Private room in rental unit | 451 |
| Entire condominium (condo) | 205 |
| Private room in residential home | 166 |
# Fit regression model
model1 <-lm(log(price_4_nights) ~ prop_type_simplified + number_of_reviews + review_scores_rating, data = train_data)
msummary(model1) Estimate Std. Error
(Intercept) 5.8825214 0.0956056
prop_type_simplifiedEntire rental unit -0.0170932 0.0522687
prop_type_simplifiedOther 0.1614693 0.0551674
prop_type_simplifiedPrivate room in rental unit -0.5634344 0.0598195
prop_type_simplifiedPrivate room in residential home -0.4304863 0.0718322
number_of_reviews -0.0006096 0.0001336
review_scores_rating -0.0239373 0.0174873
t value Pr(>|t|)
(Intercept) 61.529 < 2e-16 ***
prop_type_simplifiedEntire rental unit -0.327 0.74368
prop_type_simplifiedOther 2.927 0.00346 **
prop_type_simplifiedPrivate room in rental unit -9.419 < 2e-16 ***
prop_type_simplifiedPrivate room in residential home -5.993 2.39e-09 ***
number_of_reviews -4.563 5.32e-06 ***
review_scores_rating -1.369 0.17118
Residual standard error: 0.5301 on 2279 degrees of freedom
(369 observations deleted due to missingness)
Multiple R-squared: 0.1565, Adjusted R-squared: 0.1542
F-statistic: 70.45 on 6 and 2279 DF, p-value: < 2.2e-16
autoplot(model1)review_scores_rating in terms of price_4_nights.At first glance, there is a negative relationship between review_scores_ratings and price_4_nights, which seems strange given that normally we would expect properties having higher ratings will have higher prices. However, the negative relationship is very small and is nearly zero and it is not statistically significant. So we have 95% confidence to see review_scores_rating does not have too much effect on price_4_nights.
prop_type_simplified in terms of price_4_nights.prop_type_simplified is a categorical variable, so the first thing we should understand is this regression is choosing
entire condoas a base line. The intercept can be interpreted as an entire condominium (condo) will command a log price_4_nights of 5.883. If another property type is chosen such as a private room in rental unit or a private room in residential home, then the log price will be decreased by 0.563 and 0.430 respectively. This make sense as the price of renting a room will be lower than that of an entire condo.
# testing overfit
RMSE_model1 <- test_data %>%
mutate(predictions = predict(model1, .),
R = predictions - log_price_4_nights) %>%#. automatically fund data we need
select(R) %>%
na.omit() %>% # omit all the NA values in residual
summarise(RMSE = sqrt(sum(R**2 / n()))) %>%
pull()
RMSE_model1[1] 0.5181594
We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. Fit a regression model called model2 that includes all of the explananatory variables in model1 plus room_type.
Since
review_score_ratingis not a significant variable, we don’t put it in our regression model
# Fit regression model
model2 <-lm (log_price_4_nights ~ prop_type_simplified + number_of_reviews + room_type, data = train_data)
msummary(model2) Estimate Std. Error
(Intercept) 5.7926328 0.0447861
prop_type_simplifiedEntire rental unit -0.0179112 0.0464776
prop_type_simplifiedOther 0.4798858 0.0526092
prop_type_simplifiedPrivate room in rental unit 0.1145401 0.0673437
prop_type_simplifiedPrivate room in residential home 0.2094071 0.0763455
number_of_reviews -0.0008301 0.0001243
room_typeHotel room -0.1110830 0.0968612
room_typePrivate room -0.6535518 0.0416335
room_typeShared room -1.2191422 0.1491773
t value Pr(>|t|)
(Intercept) 129.340 < 2e-16 ***
prop_type_simplifiedEntire rental unit -0.385 0.69999
prop_type_simplifiedOther 9.122 < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit 1.701 0.08909 .
prop_type_simplifiedPrivate room in residential home 2.743 0.00613 **
number_of_reviews -6.676 2.99e-11 ***
room_typeHotel room -1.147 0.25156
room_typePrivate room -15.698 < 2e-16 ***
room_typeShared room -8.172 4.63e-16 ***
Residual standard error: 0.5075 on 2646 degrees of freedom
Multiple R-squared: 0.2394, Adjusted R-squared: 0.2371
F-statistic: 104.1 on 8 and 2646 DF, p-value: < 2.2e-16
Except for
room_typeHotel room, other Room type is a significant predictor of price as see by t-statistics.
# testing overfit
RMSE_model2 <- test_data %>%
mutate(predictions = predict(model2, .),
R = predictions - log_price_4_nights) %>%#. automatically fund data we need
select(R) %>%
na.omit() %>% # omit all the NA values in residual
summarise(RMSE = sqrt(sum(R**2 / n()))) %>%
pull()
RMSE_model2[1] 0.5041794
autoplot(model2)# Fit regression model
model3 <-lm (log(price_4_nights) ~ prop_type_simplified + number_of_reviews + room_type + bathrooms + bedrooms + beds + accommodates , data = train_data)
msummary(model3) Estimate Std. Error
(Intercept) 5.2901290 0.0508939
prop_type_simplifiedEntire rental unit -0.0074623 0.0463288
prop_type_simplifiedOther 0.2371531 0.0523933
prop_type_simplifiedPrivate room in rental unit -0.1082859 0.0646830
prop_type_simplifiedPrivate room in residential home -0.0282529 0.0725612
number_of_reviews -0.0007703 0.0001178
room_typeHotel room 0.2332800 0.0990694
room_typePrivate room -0.3025114 0.0412916
room_typeShared room -0.8132468 0.1340096
bathrooms 0.0551362 0.0181127
bedrooms 0.0363091 0.0124805
beds -0.0059722 0.0119793
accommodates 0.1202730 0.0093949
t value Pr(>|t|)
(Intercept) 103.944 < 2e-16 ***
prop_type_simplifiedEntire rental unit -0.161 0.87205
prop_type_simplifiedOther 4.526 6.30e-06 ***
prop_type_simplifiedPrivate room in rental unit -1.674 0.09425 .
prop_type_simplifiedPrivate room in residential home -0.389 0.69704
number_of_reviews -6.539 7.59e-11 ***
room_typeHotel room 2.355 0.01862 *
room_typePrivate room -7.326 3.24e-13 ***
room_typeShared room -6.069 1.50e-09 ***
bathrooms 3.044 0.00236 **
bedrooms 2.909 0.00366 **
beds -0.499 0.61815
accommodates 12.802 < 2e-16 ***
Residual standard error: 0.4511 on 2333 degrees of freedom
(309 observations deleted due to missingness)
Multiple R-squared: 0.4159, Adjusted R-squared: 0.4129
F-statistic: 138.4 on 12 and 2333 DF, p-value: < 2.2e-16
car::vif(model3) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.111669 4 1.193307
number_of_reviews 1.009273 1 1.004626
room_type 4.278923 3 1.274155
bathrooms 1.577555 1 1.256008
bedrooms 1.871598 1 1.368064
beds 3.049900 1 1.746396
accommodates 3.438294 1 1.854264
bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?The number of beds is not significant predictors of price_4_nights. However, the numberof bedrooms, bathrooms and size of the house are significant predictors. Given VIF is less than 5, it doesn’t seem that there is any issue of multi-collinearity.
# testing overfit
RMSE_model3 <- test_data %>%
mutate(predictions = predict(model3, .),
R = predictions - log_price_4_nights) %>%#. automatically fund data we need
select(R) %>%
na.omit() %>% # omit all the NA values in residual
summarise(RMSE = sqrt(sum(R**2 / n()))) %>%
pull()
RMSE_model3[1] 0.441413
autoplot(model3)car::vif(model4) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.306063 4 1.200218
review_scores_rating 1.049621 1 1.024510
number_of_reviews 1.050964 1 1.025165
room_type 4.404961 3 1.280335
bathrooms 1.559848 1 1.248939
bedrooms 1.776568 1 1.332879
accommodates 1.868890 1 1.367073
host_is_superhost 1.091315 1 1.044660
# testing overfit
RMSE_model4 <- test_data %>%
mutate(predictions = predict(model4, .),
R = predictions - log_price_4_nights) %>%#. automatically fund data we need
select(R) %>%
na.omit() %>% # omit all the NA values in residual
summarise(RMSE = sqrt(sum(R**2 / n()))) %>%
pull()
RMSE_model4[1] 0.4181949
At first glance, being a superhost seems command a pricing premium compared to being not. However, it is not statistically significant. So we have 95% confidence to say being a superhost doesn’t command a pricing premium.
autoplot(model4)# Fit regression model
model5 <-lm (log(price_4_nights) ~ prop_type_simplified + number_of_reviews + room_type + bathrooms + bedrooms +
accommodates + instant_bookable , data = train_data)
msummary(model5) Estimate Std. Error
(Intercept) 5.2426611 0.0518372
prop_type_simplifiedEntire rental unit 0.0090906 0.0464155
prop_type_simplifiedOther 0.2419394 0.0524729
prop_type_simplifiedPrivate room in rental unit -0.0826566 0.0649542
prop_type_simplifiedPrivate room in residential home 0.0009518 0.0727467
number_of_reviews -0.0008397 0.0001188
room_typeHotel room 0.1813531 0.1001331
room_typePrivate room -0.3121019 0.0414436
room_typeShared room -0.8107628 0.1346822
bathrooms 0.0585907 0.0181705
bedrooms 0.0355044 0.0123689
accommodates 0.1175597 0.0069911
instant_bookable 0.0921779 0.0196610
t value Pr(>|t|)
(Intercept) 101.137 < 2e-16 ***
prop_type_simplifiedEntire rental unit 0.196 0.84474
prop_type_simplifiedOther 4.611 4.23e-06 ***
prop_type_simplifiedPrivate room in rental unit -1.273 0.20331
prop_type_simplifiedPrivate room in residential home 0.013 0.98956
number_of_reviews -7.069 2.05e-12 ***
room_typeHotel room 1.811 0.07025 .
room_typePrivate room -7.531 7.15e-14 ***
room_typeShared room -6.020 2.02e-09 ***
bathrooms 3.224 0.00128 **
bedrooms 2.870 0.00414 **
accommodates 16.816 < 2e-16 ***
instant_bookable 4.688 2.91e-06 ***
Residual standard error: 0.4534 on 2350 degrees of freedom
(292 observations deleted due to missingness)
Multiple R-squared: 0.4158, Adjusted R-squared: 0.4128
F-statistic: 139.4 on 12 and 2350 DF, p-value: < 2.2e-16
car::vif(model5) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.218653 4 1.197145
number_of_reviews 1.017779 1 1.008850
room_type 4.353848 3 1.277847
bathrooms 1.573146 1 1.254251
bedrooms 1.822213 1 1.349894
accommodates 1.890973 1 1.375126
instant_bookable 1.053686 1 1.026492
# testing overfit
RMSE_model5 <- test_data %>%
mutate(predictions = predict(model5, .),
R = predictions - log_price_4_nights) %>%#. automatically fund data we need
select(R) %>%
na.omit() %>% # omit all the NA values in residual
summarise(RMSE = sqrt(sum(R**2 / n()))) %>%
pull()
RMSE_model5[1] 0.4411484
autoplot(model5)instant_bookable a significant predictor of price_4_nights?Instant_bookable is a significant predictor of price as seen by t statistics.
We have a member of our study group from Brussels. He suggests we group neighbourhoods into ‘North West’, ‘North East’, ‘East’, ‘West’, ‘South’
new_listings <- new_listings %>%
mutate(neighbourhood_simplified = case_when(neighbourhood_cleansed %in% c("Jette","Berchem-Sainte-Agathe","Koekelberg", "Molenbeek-Saint-Jean",
"Ganshoren") ~ "North West",
neighbourhood_cleansed %in% c( "Saint-Josse-ten-Noode", "Schaerbeek", "Bruxelles", "Evere") ~ "North East",
neighbourhood_cleansed %in% c("Woluwe-Saint-Lambert", "Woluwe-Saint-Pierre","Auderghem", "Etterbeek") ~ "East/Centre",
neighbourhood_cleansed %in% c("Saint-Gilles", "Anderlecht", "Forest") ~ " West/Centre",
neighbourhood_cleansed %in% c("Ixelles", "Uccle", "Watermael-Boitsfort") ~ "South/Centre"))# Fit regression model
model6 <-lm (log_price_4_nights ~ prop_type_simplified + number_of_reviews + room_type + bathrooms + bedrooms +
accommodates + instant_bookable + neighbourhood_simplified , data = train_data)
msummary(model6) Estimate Std. Error
(Intercept) 5.1873923 0.0545054
prop_type_simplifiedEntire rental unit 0.0044044 0.0459951
prop_type_simplifiedOther 0.2381880 0.0519935
prop_type_simplifiedPrivate room in rental unit -0.0954069 0.0644310
prop_type_simplifiedPrivate room in residential home 0.0063421 0.0722624
number_of_reviews -0.0008850 0.0001181
room_typeHotel room 0.1656419 0.0994094
room_typePrivate room -0.2975245 0.0411531
room_typeShared room -0.8018106 0.1334929
bathrooms 0.0578702 0.0180054
bedrooms 0.0377215 0.0122794
accommodates 0.1164714 0.0069395
instant_bookable 0.0829841 0.0195798
neighbourhood_simplifiedEast/Centre 0.0077999 0.0373416
neighbourhood_simplifiedNorth East 0.1231149 0.0254769
neighbourhood_simplifiedNorth West -0.0983502 0.0422654
neighbourhood_simplifiedSouth/Centre 0.0701086 0.0288843
t value Pr(>|t|)
(Intercept) 95.172 < 2e-16 ***
prop_type_simplifiedEntire rental unit 0.096 0.92372
prop_type_simplifiedOther 4.581 4.87e-06 ***
prop_type_simplifiedPrivate room in rental unit -1.481 0.13880
prop_type_simplifiedPrivate room in residential home 0.088 0.93007
number_of_reviews -7.496 9.25e-14 ***
room_typeHotel room 1.666 0.09580 .
room_typePrivate room -7.230 6.52e-13 ***
room_typeShared room -6.006 2.19e-09 ***
bathrooms 3.214 0.00133 **
bedrooms 3.072 0.00215 **
accommodates 16.784 < 2e-16 ***
instant_bookable 4.238 2.34e-05 ***
neighbourhood_simplifiedEast/Centre 0.209 0.83456
neighbourhood_simplifiedNorth East 4.832 1.44e-06 ***
neighbourhood_simplifiedNorth West -2.327 0.02005 *
neighbourhood_simplifiedSouth/Centre 2.427 0.01529 *
Residual standard error: 0.4491 on 2346 degrees of freedom
(292 observations deleted due to missingness)
Multiple R-squared: 0.4277, Adjusted R-squared: 0.4238
F-statistic: 109.6 on 16 and 2346 DF, p-value: < 2.2e-16
car::vif(model6) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.278407 4 1.199251
number_of_reviews 1.024343 1 1.012098
room_type 4.398691 3 1.280031
bathrooms 1.574201 1 1.254672
bedrooms 1.830238 1 1.352863
accommodates 1.898747 1 1.377950
instant_bookable 1.064959 1 1.031968
neighbourhood_simplified 1.059886 4 1.007297
# testing overfit
RMSE_model6 <- test_data %>%
mutate(predictions = predict(model5, .),
R = predictions - log_price_4_nights) %>%#. automatically fund data we need
select(R) %>%
na.omit() %>% # omit all the NA values in residual
summarise(RMSE = sqrt(sum(R**2 / n()))) %>%
pull()
RMSE_model6[1] 0.4411484
autoplot(model6)Location is a good significant predictor of
price_4_nightsas seen by t-statistics. Rooms located in the East won’t have a significant effect on price, however, rooms located in North East, North West, South have significant postive effect onprice_4_night
avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?# Fit regression model
model7 <-lm (log_price_4_nights ~ prop_type_simplified + number_of_reviews + room_type + bathrooms + bedrooms +
accommodates + instant_bookable + neighbourhood_simplified + reviews_per_month + availability_30 , data = train_data)
# Get regression table:
msummary(model7) Estimate Std. Error
(Intercept) 5.151e+00 5.247e-02
prop_type_simplifiedEntire rental unit -1.150e-02 4.381e-02
prop_type_simplifiedOther 1.595e-01 4.944e-02
prop_type_simplifiedPrivate room in rental unit -7.750e-02 6.209e-02
prop_type_simplifiedPrivate room in residential home 4.815e-02 6.903e-02
number_of_reviews 6.408e-05 1.343e-04
room_typeHotel room 9.306e-02 9.790e-02
room_typePrivate room -4.225e-01 4.001e-02
room_typeShared room -8.448e-01 1.178e-01
bathrooms 3.724e-02 1.640e-02
bedrooms 3.340e-02 1.102e-02
accommodates 1.200e-01 6.486e-03
instant_bookable 6.457e-02 1.877e-02
neighbourhood_simplifiedEast/Centre 1.076e-03 3.540e-02
neighbourhood_simplifiedNorth East 9.706e-02 2.435e-02
neighbourhood_simplifiedNorth West -1.475e-01 4.062e-02
neighbourhood_simplifiedSouth/Centre 5.262e-02 2.728e-02
reviews_per_month -5.345e-02 6.338e-03
availability_30 1.628e-02 8.694e-04
t value Pr(>|t|)
(Intercept) 98.178 < 2e-16 ***
prop_type_simplifiedEntire rental unit -0.263 0.792948
prop_type_simplifiedOther 3.226 0.001274 **
prop_type_simplifiedPrivate room in rental unit -1.248 0.212070
prop_type_simplifiedPrivate room in residential home 0.698 0.485547
number_of_reviews 0.477 0.633356
room_typeHotel room 0.951 0.341927
room_typePrivate room -10.560 < 2e-16 ***
room_typeShared room -7.173 1.03e-12 ***
bathrooms 2.270 0.023315 *
bedrooms 3.031 0.002472 **
accommodates 18.497 < 2e-16 ***
instant_bookable 3.441 0.000592 ***
neighbourhood_simplifiedEast/Centre 0.030 0.975751
neighbourhood_simplifiedNorth East 3.986 6.97e-05 ***
neighbourhood_simplifiedNorth West -3.631 0.000290 ***
neighbourhood_simplifiedSouth/Centre 1.929 0.053932 .
reviews_per_month -8.433 < 2e-16 ***
availability_30 18.729 < 2e-16 ***
Residual standard error: 0.393 on 2006 degrees of freedom
(630 observations deleted due to missingness)
Multiple R-squared: 0.5569, Adjusted R-squared: 0.553
F-statistic: 140.1 on 18 and 2006 DF, p-value: < 2.2e-16
For this model, we find number_of_reviews is not significant, then we try to replace it with review_scores_rating, then this is significant. This might because reviews_per_month could represent much information of number_of_review, so this variable become insignificant.
# Fit regression model
model7 <-lm (log_price_4_nights ~ prop_type_simplified + review_scores_rating + room_type + bathrooms + bedrooms +
accommodates + instant_bookable + neighbourhood_simplified + reviews_per_month + availability_30 , data = train_data)
# Get regression table:
msummary(model7) Estimate Std. Error
(Intercept) 4.9696786 0.0839576
prop_type_simplifiedEntire rental unit -0.0084988 0.0436885
prop_type_simplifiedOther 0.1635574 0.0491512
prop_type_simplifiedPrivate room in rental unit -0.0685048 0.0616751
prop_type_simplifiedPrivate room in residential home 0.0507285 0.0685730
review_scores_rating 0.0383368 0.0139127
room_typeHotel room 0.0863199 0.0977376
room_typePrivate room -0.4252994 0.0398456
room_typeShared room -0.8333446 0.1175003
bathrooms 0.0366305 0.0163751
bedrooms 0.0340481 0.0110024
accommodates 0.1202994 0.0064751
instant_bookable 0.0677469 0.0187699
neighbourhood_simplifiedEast/Centre 0.0003808 0.0353283
neighbourhood_simplifiedNorth East 0.0969861 0.0243075
neighbourhood_simplifiedNorth West -0.1481909 0.0405403
neighbourhood_simplifiedSouth/Centre 0.0494046 0.0272550
reviews_per_month -0.0533501 0.0050452
availability_30 0.0166477 0.0008780
t value Pr(>|t|)
(Intercept) 59.193 < 2e-16 ***
prop_type_simplifiedEntire rental unit -0.195 0.845778
prop_type_simplifiedOther 3.328 0.000892 ***
prop_type_simplifiedPrivate room in rental unit -1.111 0.266815
prop_type_simplifiedPrivate room in residential home 0.740 0.459524
review_scores_rating 2.756 0.005913 **
room_typeHotel room 0.883 0.377245
room_typePrivate room -10.674 < 2e-16 ***
room_typeShared room -7.092 1.82e-12 ***
bathrooms 2.237 0.025399 *
bedrooms 3.095 0.001998 **
accommodates 18.579 < 2e-16 ***
instant_bookable 3.609 0.000314 ***
neighbourhood_simplifiedEast/Centre 0.011 0.991402
neighbourhood_simplifiedNorth East 3.990 6.85e-05 ***
neighbourhood_simplifiedNorth West -3.655 0.000263 ***
neighbourhood_simplifiedSouth/Centre 1.813 0.070031 .
reviews_per_month -10.574 < 2e-16 ***
availability_30 18.960 < 2e-16 ***
Residual standard error: 0.3923 on 2006 degrees of freedom
(630 observations deleted due to missingness)
Multiple R-squared: 0.5586, Adjusted R-squared: 0.5546
F-statistic: 141 on 18 and 2006 DF, p-value: < 2.2e-16
car::vif(model7) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 4.437179 4 1.204726
review_scores_rating 1.057403 1 1.028301
room_type 4.548614 3 1.287201
bathrooms 1.564861 1 1.250944
bedrooms 1.786566 1 1.336625
accommodates 1.878203 1 1.370475
instant_bookable 1.092379 1 1.045169
neighbourhood_simplified 1.090815 4 1.010925
reviews_per_month 1.093100 1 1.045514
availability_30 1.106151 1 1.051737
# testing overfit
RMSE_model7 <- test_data %>%
mutate(predictions = predict(model7, .),
R = predictions - log_price_4_nights) %>%#. automatically fund data we need
select(R) %>%
na.omit() %>% # omit all the NA values in residual
summarise(RMSE = sqrt(sum(R**2 / n()))) %>%
pull()
RMSE_model7[1] 0.3703
autoplot(model7)
availability_30andreviews_per_monthhave significant positive effect onprice_4_nights
#library(huxtable)
huxreg(list('model1' = model1,
'model2' = model2,
'model3' = model3,
'model4' = model4,
'model5' = model5,
'model6' = model6,
'model7' = model7),
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
bold_signif = 0.05,
stars = NULL
) %>%
set_caption('Comparison of models')| model1 | model2 | model3 | model4 | model5 | model6 | model7 | |
|---|---|---|---|---|---|---|---|
| (Intercept) | 5.883 | 5.793 | 5.290 | 5.331 | 5.243 | 5.187 | 4.970 |
| (0.096) | (0.045) | (0.051) | (0.089) | (0.052) | (0.055) | (0.084) | |
| prop_type_simplifiedEntire rental unit | -0.017 | -0.018 | -0.007 | 0.005 | 0.009 | 0.004 | -0.008 |
| (0.052) | (0.046) | (0.046) | (0.049) | (0.046) | (0.046) | (0.044) | |
| prop_type_simplifiedOther | 0.161 | 0.480 | 0.237 | 0.234 | 0.242 | 0.238 | 0.164 |
| (0.055) | (0.053) | (0.052) | (0.055) | (0.052) | (0.052) | (0.049) | |
| prop_type_simplifiedPrivate room in rental unit | -0.563 | 0.115 | -0.108 | -0.034 | -0.083 | -0.095 | -0.069 |
| (0.060) | (0.067) | (0.065) | (0.069) | (0.065) | (0.064) | (0.062) | |
| prop_type_simplifiedPrivate room in residential home | -0.430 | 0.209 | -0.028 | 0.086 | 0.001 | 0.006 | 0.051 |
| (0.072) | (0.076) | (0.073) | (0.076) | (0.073) | (0.072) | (0.069) | |
| number_of_reviews | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | -0.001 | |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | ||
| review_scores_rating | -0.024 | -0.016 | 0.038 | ||||
| (0.017) | (0.015) | (0.014) | |||||
| room_typeHotel room | -0.111 | 0.233 | 0.290 | 0.181 | 0.166 | 0.086 | |
| (0.097) | (0.099) | (0.108) | (0.100) | (0.099) | (0.098) | ||
| room_typePrivate room | -0.654 | -0.303 | -0.381 | -0.312 | -0.298 | -0.425 | |
| (0.042) | (0.041) | (0.044) | (0.041) | (0.041) | (0.040) | ||
| room_typeShared room | -1.219 | -0.813 | -0.786 | -0.811 | -0.802 | -0.833 | |
| (0.149) | (0.134) | (0.131) | (0.135) | (0.133) | (0.118) | ||
| bathrooms | 0.055 | 0.039 | 0.059 | 0.058 | 0.037 | ||
| (0.018) | (0.018) | (0.018) | (0.018) | (0.016) | |||
| bedrooms | 0.036 | 0.029 | 0.036 | 0.038 | 0.034 | ||
| (0.012) | (0.012) | (0.012) | (0.012) | (0.011) | |||
| beds | -0.006 | ||||||
| (0.012) | |||||||
| accommodates | 0.120 | 0.128 | 0.118 | 0.116 | 0.120 | ||
| (0.009) | (0.007) | (0.007) | (0.007) | (0.006) | |||
| host_is_superhost | -0.005 | ||||||
| (0.024) | |||||||
| instant_bookable | 0.092 | 0.083 | 0.068 | ||||
| (0.020) | (0.020) | (0.019) | |||||
| neighbourhood_simplifiedEast/Centre | 0.008 | 0.000 | |||||
| (0.037) | (0.035) | ||||||
| neighbourhood_simplifiedNorth East | 0.123 | 0.097 | |||||
| (0.025) | (0.024) | ||||||
| neighbourhood_simplifiedNorth West | -0.098 | -0.148 | |||||
| (0.042) | (0.041) | ||||||
| neighbourhood_simplifiedSouth/Centre | 0.070 | 0.049 | |||||
| (0.029) | (0.027) | ||||||
| reviews_per_month | -0.053 | ||||||
| (0.005) | |||||||
| availability_30 | 0.017 | ||||||
| (0.001) | |||||||
| #observations | 2286 | 2655 | 2346 | 2024 | 2363 | 2363 | 2025 |
| R squared | 0.156 | 0.239 | 0.416 | 0.447 | 0.416 | 0.428 | 0.559 |
| Adj. R Squared | 0.154 | 0.237 | 0.413 | 0.444 | 0.413 | 0.424 | 0.555 |
| Residual SE | 0.530 | 0.508 | 0.451 | 0.439 | 0.453 | 0.449 | 0.392 |
RMSE in the testing dataset
data_frame(RMSE_model1,RMSE_model2,RMSE_model3,RMSE_model4, RMSE_model5,
RMSE_model6,RMSE_model7)| RMSE_model1 | RMSE_model2 | RMSE_model3 | RMSE_model4 | RMSE_model5 | RMSE_model6 | RMSE_model7 |
|---|---|---|---|---|---|---|
| 0.518 | 0.504 | 0.441 | 0.418 | 0.441 | 0.441 | 0.37 |
Model 7 has the highest adjusted R^2, and also the lowest RMSE in testing set, which means model7 has the best explaining ability with no overfitting. So we use model7 for prediction.
data <- new_listings %>%
filter(review_scores_rating >= 4.5,
prop_type_simplified == 'Private room in rental unit',
room_type == 'Private room',
number_of_reviews >= 10,
neighbourhood_simplified == 'North West')
data %>%
mutate(predictions = predict(model7,.)) %>%
select(predictions) %>%
summarise(mean = mean(predictions, na.rm = TRUE),
std = sd(predictions, na.rm =TRUE),
count = n(),
SE = std / sqrt(count),
# get t-critical value with (n-1) degrees of freedom,
t = qt(0.975, count - 1),
margin_of_error = t * SE,
lower_CI = mean - margin_of_error,
higher_CI = mean + margin_of_error) %>%
select(mean, std, lower_CI, higher_CI) %>%
mutate(mean = exp(mean),
lower_CI = exp(lower_CI),
higher_CI = exp(higher_CI))| mean | std | lower_CI | higher_CI |
|---|---|---|---|
| 123 | 0.112 | 116 | 132 |
Suppose I want to order a private room in rental unit, located in North West. We want this room to have more than 10 reviews with an average score rating higher than 4.5. Based on the existing dataset, our point estimation for the price I should pay for 4 nights is 123.4 Euros, and 95% upper price is 131.6 Euros, 95% lower price is 115.7 Euros.
3.1.1.1 Comments and Analysis